{targets} 🎯

An introduction to the package {targets}

Jake Tufts

{targets} is a pipeline tool

Use it to coordinate your data analysis projects

🧑‍💻

Why use {targets}?

🎯

“{targets} implicitly nudges users toward a clean, function-oriented programming style that fits the intent of the R language”

Data analysis can be slow and repetitive

G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C
G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C
G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C
G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C
G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C
G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C
G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C
G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C
G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C
G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C

Data analysis can be slow and repetitive

G A Launch the codeB Wait while it runsA->B D Restart from scratchA->D C Discover an issueB->C D->C

How does {targets} combat this?



  1. Skips unchanged code/ steps 🏃
  1. Evidence that code = results 🔬
  1. Produces code dependency graphs 🦾
  1. Makes parallel computing easy 🔀

Code walkthrough



If you want to follow along in your local R environment, you can clone the repository at this link:


https://github.com/JT-39/targets-coffee-code-walkthrough

Setup



install.packages("targets")

Create an R project



usethis::create_project(path = r"(C:\Users\jtufts\Documents\targets-demo)")


Tip

Quick plug that you can use my package to create a pre-populated R project directory!

dauRtemplate::dau_proj_template(path = r"(C:\Users\jtufts\Documents\targets-demo)")


More information here: https://github.com/JT-39/dau-R-template-ext

Develop your data pipeline



The project directory could look something like…


├── analysis.R
├── _data/
   ├── raw_data.csv
├── R/
   ├── functions.R

Absence data


analysis.R


source(here::here("src/R/functions.R"))

# Path to absence data
absence_data_file_path <- here::here("_data/raw/1_absence_3term_nat_reg_la.csv")

# Extract national absence and format date
df_nat_absence <- get_nat_absence_data(absence_data_file_path) |>
  format_time_period()

# Fit a linear model
model <- fit_model(df_nat_absence)

# Plot the data and model
plot_model(model, df_nat_absence)

functions.R


# Pull national absence from file
get_nat_absence_data <- function(file_path) {
  read.csv(file = file_path) |>
    dplyr::filter(geographic_level == "National")
}

# Extract the start year from academic year
extract_year <- function(date) {
  paste0(substr(date, 1, 4))
}

# Format the year as a date
format_time_period <- function(data) {
  data |>
    dplyr::mutate(Date = lubridate::year(as.Date(extract_year(time_period),
                                                 format = "%Y")),
                  .after=time_period)
}


# Fit the model and pull coefficients
fit_model <- function(data) {
  lm(sess_overall_percent_pa_10_exact ~ Date, data) |>
    coefficients()
}

# Round to the nearets multiple of five
round_to_multiple_five <- function(x) {
  ceiling((x + 1)/5)*5
}

# Plot the data and model
plot_model <- function(model, data) {
  ggplot2::ggplot(data) +
    ggplot2::geom_point(ggplot2::aes(x = Date,
                                     y = sess_overall_percent_pa_10_exact,
                                     colour = school_type)) +
    ggplot2::geom_line(ggplot2::aes(x = Date,
                                     y = sess_overall_percent_pa_10_exact,
                                     colour = school_type)) +
    ggplot2::scale_colour_manual(values = kasstylesr::color_picker(4),
                                 breaks = c("Total", "State-funded primary",
                                            "State-funded secondary", "Special")) +
    ggplot2::geom_abline(intercept = model[1],
                         slope = model[2],
                         show.legend = T,
                         colour="red",
                         linetype="dashed") +
    ggplot2::annotate("text",
                      x = max(data$Date),
                      y = lm(sess_overall_percent_pa_10_exact ~ Date,
                             df_nat_absence) |>
                        fitted.values() |>
                        max(),
                      hjust = -0.45,
                      label = "Line of best fit",
                      colour = "red") +
    ggplot2::scale_y_continuous(breaks = scales::pretty_breaks(),
                                limits = function(x) {
                                  c(0, round_to_multiple_five(max(x)))
                                }) +
    ggplot2::coord_cartesian(clip = 'off') +
    ggplot2::theme_minimal() +
    kasstylesr::kas_style() +
    ggplot2::labs(
      title = "Average persistent absence over time in England",
      subtitle = "Split by school type. Only includes persistent absentees.",
      x = "",
      y = "Overall absence rate (%)",
      colour = "School type:"
    )
}

Now make a {targets} pipeline



├── _targets_analysis.R*
├── _data/
   ├── raw_data.csv
├── R/
   ├── functions.R

Create a {targets} template


targets::tar_script()


There are 3 parts to the pipeline

  1. Packages are loaded
library(targets)
  1. Pipeline-specific options are set
tar_option_set(packages = "utils")
tar_source(here::here("R/functions.R"))
  1. The pipeline itself, a series of targets
list(
  tar_target(name = data,
             command = data.frame(x = sample.int(100),
                                  y = sample.int(100))),
  tar_target(name = data_summary,
             command = summarize_data(data))
)

The {targets} pipeline



  • It is simply a list of targets 🎯
  • A target is defined using tar_target()
  • Each has 2 necessary inputs
    • name
    • command ~ the function that generates the target

The {targets} pipeline


So… a target defined as:

tar_target(name = y, command = f(x))


Can be understood as:

y <- f(x)


  • Targets after this can use the output of previous targets

The {targets} pipeline


Build the pipeline



tar_make(script = "src/_targets_analysis.R")
> dispatched target data_file
o completed target data_file [1.69 seconds]
> dispatched target nat_data
o completed target nat_data [0.45 seconds]
> dispatched target nat_data_clean
o completed target nat_data_clean [0 seconds]
> dispatched target model
o completed target model [0 seconds]
Saving 7 x 7 in image
> dispatched target plot
o completed target plot [7.14 seconds]
> ended pipeline [9.67 seconds]

An empty global environment?


What is going on?



  • {targets} creates a pipeline of pure functions
  • Running the pipeline does not depend on the global environment 🌍
  • It also does not change anything outside its scope 🔭
  • The pipeline is PURE

_targets/



  • The outputs are stored in _targets/objects/
  • Each target is saved as an object in the .rds format
  • So, you MUST add _targets/ to .gitignore (for GitHub) ☝️

_targets/


_targets/
├── meta
   ├── meta
   ├── process
   └── progress
├── objects
   ├── model
   ├── nat_data
   └── nat_data_clean
└── user


_targets/


To display targets:

tar_read(model)
 (Intercept)         Date 
-301.9190440    0.1597813 

_targets/


To load targets into the global environment:

tar_load(model)


Handling files


Advanced

  • MUST use the data’s file path initially (and format = "file")
targets::tar_target(
    name = data,
    command = here::here("_data/raw/1_absence_3term_nat_reg_la.csv"),
    format = "file"
)
  • Then use the file path target to read in the data
targets::tar_target(
    name = nat_data,
    command = read.csv(data)
)
  • Otherwise, {targets} will not track the data (any changes to it)

File Big Brother



  • {targets} keeps track of changes in files and functions 🕵️
  • Any changes will result in the targets being identified as out-of-date
  • So, are re-computed (and their dependencies)
  • We can visualise this…

Code dependency graph


targets::tar_visnetwork(script = "src/_targets_analysis.R")

Let’s change something



  • We now filter out the years 2020 & 2021
format_time_period <- function(data) {
  data |>
    ... |>
    dplyr::filter(!Date %in% c(2020, 2021))
}

Let’s change something


targets::tar_visnetwork(script = "src/_targets_analysis.R")

These can be embedded graphs…


graph LR
style Legend fill:#FFFFFF00,stroke:#000000
style Graph fill:#FFFFFF00,stroke:#000000;
  subgraph Legend
    direction LR
    xf1522833a4d242c5([Up to date]):::uptodate --- xd03d7c7dd2ddda2b([Stem]):::none
    xd03d7c7dd2ddda2b([Stem]):::none --- xeb2d7cac8a1ce544>Function]:::none
  end
  subgraph Graph
    direction LR
    xb1fbb690b4ec8c10>extract_year]:::uptodate --> xd20a83ce47f3194c>format_time_period]:::uptodate
    xf7d598eca7911241>round_to_multiple_five]:::uptodate --> xec203b5a68d60f72>plot_model]:::uptodate
    xd20a83ce47f3194c>format_time_period]:::uptodate --> xc2980a3d74445b80([nat_data_clean]):::uptodate
    x83c942fcaf37c3dc([nat_data]):::uptodate --> xc2980a3d74445b80([nat_data_clean]):::uptodate
    x0d01c84c9424364d([data_file]):::uptodate --> x83c942fcaf37c3dc([nat_data]):::uptodate
    x9242f8c59a209716>get_nat_absence_data]:::uptodate --> x83c942fcaf37c3dc([nat_data]):::uptodate
    x9043e9d6bef6a839([model]):::uptodate --> x667cd56a75e2bb2b([plot]):::uptodate
    xc2980a3d74445b80([nat_data_clean]):::uptodate --> x667cd56a75e2bb2b([plot]):::uptodate
    xec203b5a68d60f72>plot_model]:::uptodate --> x667cd56a75e2bb2b([plot]):::uptodate
    x12e88730e39644dc>fit_model]:::uptodate --> x9043e9d6bef6a839([model]):::uptodate
    xc2980a3d74445b80([nat_data_clean]):::uptodate --> x9043e9d6bef6a839([model]):::uptodate
  end
classDef uptodate stroke:#000000,color:#ffffff,fill:#354823;
classDef none stroke:#000000,color:#000000,fill:#94a4ac;
linkStyle 0 stroke-width:0px;
linkStyle 1 stroke-width:0px;

… or save them as a .html


htmltools::save_html(html = targets::tar_visnetwork("src/_targets_analysis.R"),
                     file = "_outputs/code_pipeline.html")

Parallel computation


  • Can run any independent steps (targets) in parallel 🔀
  • We are able to harness any GPU cores available, cutting computation time ✂️
  • {targets} knows which parts of the pipeline can be ran in parallel
  • To set this up…

Parallel computation


Need to load a few more packages:

library(targets)
library(future)
library(future.callr)
plan(callr)

Utilise the {targets} function to run in parallel:

# Set workers = 2 to use 2 cpu cores
targets::tar_make_future(workers = 2)

Simple as that! 💥

Other features

🔮

RMarkdown & Quarto


  • Can include the rendering of a .Rmd or .qmd to the pipeline
  • Utilises the power of the {targets} computation 🔋


Load package needed

libray(tarchetypes)

Function to render Quarto file in the pipeline

tar_render(
  my_doc,
  "my_document.qmd"
)

How to load targets in the Quarto file

```{r}
tar_read(plot)
```

SQL

  • When executing code to SQL you will need to create connect and disconnect each time 🔌
function(yml_key) {
  conn <- connect_sql_db(yml_key)
  on.exit(DBI::dbDisconnect(conn, shutdown = TRUE))

  DBI::dbGetQuery(...)}
  • Also the package {sqltargets} which applies {targets} principles to .sql files 📦
-- !preview conn=DBI::dbConnect(RSQLite::SQLite())
-- tar_load(query_params)
select id
from table
where age > {age_threshold}
tar_sql(query, path = "query.sql", query_params = query_params)

Key takeaways



  • Enforces a modular, function based pipelines ✅
    (R best practice!)
  • Tracks changes to datasets and functions 🔍
  • Provides efficient computation of pipelines ♻️

We’ve hit the target!

🎯

Resources, links and email


  • {targets} manual: Link

  • YouTube {targets} walkthrough: Link

  • Ofsted MI {targets} example pipeline GitHub: Link

  • These slides and mini {targets} example GitHub: Link

  • {sqltargets} GitHub: Link

  • Building reproducible analytical pipelines with R: Link

Email me at:

jake.tufts@education.gov.uk

Example of a larger pipeline


Ofsted MI {targets} example pipeline GitHub

Process map

Targets pipeline